An Inhomogeneous Poisson Process (IPP) is a statistical model used for events that occur randomly over space or time. Unlike a regular Poisson process, the rate of event occurrence in an IPP can vary.
IPP is often used in presence-only problems to model the intensity of events across different locations or times. It’s defined as \(\lambda(x) = N * f(x)\), where \(\lambda(x)\) is the intensity function, \(N\) is the total number of events, and \(f(x)\) is the density function.
By fitting an IPP to presence-only data, we estimate the underlying intensity function, which tells us how the rate of event occurrence changes across different locations or times. For example, we might hypothesize that the presence of the Cedar Waxwing is influenced by factors like canopy cover, land cover type, and temperature. We use the presence-only data to fit the IPP model, estimating the parameters of \(\lambda(x)\) that best fit the observed data.
The intensity function \(\lambda(x)\) in an IPP is typically defined as a function of some parameters \(\theta\) and the environmental variables at location \(x\). For example, we might have For example, \(\lambda(x; \theta) = \exp(\theta_1 * canopy + \theta_2 * land\_cover + ...)\), where \(canopy\), \(land\_cover\), etc. are the environmental variables, and \(\theta_1\), \(\theta_2\), etc. are the parameters to be estimated.
The likelihood of the observed data given the parameters \(\theta\) is given by:
\(L(θ) = [∏ λ(x_i; θ)] * exp(-∫ λ(x; θ) dx)\), where the product is over all observed presence locations \(x_i\), and the integral is over all possible locations \(x\). The first part of this formula represents the probability of observing the species at the observed locations, and the second part represents the probability of not observing the species at any other location.
The goal of maximum likelihood estimation is to find the parameters \(θ\) that maximize this likelihood. This is typically done using numerical optimization methods, such as gradient descent or Newton’s method.
Once we have estimated the parameters \(θ\), we can use them to calculate the intensity function \(λ(x; θ)\) at any location \(x\). This gives us an estimate of the rate of species occurrence at that location, based on the environmental variables at that location.
As with any parametric statistical model, one of the biggest constraints is that there are certain assumptions that must be met in order for the model to technically be “valid”.
Entropy is a measure of uncertainty, randomness, or chaos in a set of data. In other words, it quantifies the amount of unpredictability or surprise in possible outcomes.
Maximum Entropy (MaxEnt) is a method that selects the most spread-out probability distribution fitting our known data. It’s useful in presence-only problems as it minimizes bias in estimating event occurrence based on observed data. “It agrees with everything that is known, but carefully avoids assuming anything that is not known” (Jaynes, 1990).
# Load the dataset saved in part 2 of the study
df <- readRDS("artifacts/final_data/final_data.rds")
# Define some global variables that will be referenced throughout the modeling
states <- sort(unique(df$state))
species <- sort(unique(df$common.name))
# Convert to data.table
df %>% setDT()
# View output
df %>% as_tibble()
# Load "Feature Engineered" rasters and original rasters into a
# single multi-layer raster by state
r.list <- set_names(states) %>%
purrr::map(~rast(c(paste0("data/final_rasters/", .x, ".tif"),
file.path("artifacts/feature_engineered_final",
paste0(.x, ".tif")))))
# Create plots of NC Rasters as an example
nc.rast.plts <- purrr::map(names(r.list$NC), function(r.name) {
r.df <- terra::as.data.frame(r.list$NC[[r.name]], xy=T)
p <- ggplot(r.df, aes(x=x, y=y, fill=!!sym(r.name))) +
geom_raster() +
coord_cartesian()
if (r.name != "NLCD_Land") p <- p + scale_fill_viridis_c()
p + labs(title=r.name) + theme(legend.position="none")
}) %>%
ggarrange(plotlist=.,
ncol=4, nrow=ceiling(length(names(r.list$NC)) / 4)) +
ggtitle("All Raster Layers for NC")
# View the plots of all of the NC rasters
print(nc.rast.plts)